-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference kernel for 3D convolution for non-packed tensors #2334
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to make all of the indexing types unit64_t
and adjust all of the intermediate calculations to reflect this change.
It's a generic naive kernel which should be applicable to almost all the possible cases including huge tensors.
@CAHEK7 : thank you for the comments. Agree with possibly all of them. Some of the stuff like 32-bit indexes is from earlier and I think we should promote them all to 64-bit unsigned. Will discuss it with the team. Typically we use 32-bit for performance reasons but that's not critical here. |
@iq136boy Can you please take sometime to review this kernel. |
8b2b01a
to
082d166
Compare
@amberhassaan could you resolve this issue?
|
@junliume : yes, on it. I was debugging some deeper issues with hipRTC kernel launch, so have been ignoring this warning. |
99b30a6
to
09616bc
Compare
2a8f828
to
6e37c43
Compare
|
Please use |
@amberhassaan could you please resolve the conflicts by merging the latest develop branch into your feature branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code quality is to be improved prior merging.
result.invoker_factory = [=](const std::vector<Kernel>& kernels) { | ||
const auto kern = kernels[0]; | ||
return [=](const Handle& handle, const AnyInvokeParams& primitive_parameters) { | ||
decltype(auto) data_ctx = primitive_parameters.CastTo<conv::DataInvokeParams>(); | ||
const auto& tensors = data_ctx.tensors; | ||
float elapsed = 0; | ||
|
||
auto in_strides = MakeStrideArray<6>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Perf] #2334 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
result.invoker_factory = [=](const std::vector<Kernel>& kernels) { | ||
const auto kern = kernels[0]; | ||
return [=](const Handle& handle, const AnyInvokeParams& primitive_parameters) { | ||
decltype(auto) data_ctx = primitive_parameters.CastTo<conv::WrWInvokeParams>(); | ||
const auto& tensors = data_ctx.tensors; | ||
float elapsed = 0; | ||
|
||
auto in_strides = MakeStrideArray<5>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Perf] #2334 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
result.invoker_factory = [=](const std::vector<Kernel>& kernels) { | ||
const auto kern = kernels[0]; | ||
return [=](const Handle& handle, const AnyInvokeParams& primitive_parameters) { | ||
decltype(auto) data_ctx = primitive_parameters.CastTo<conv::WrWInvokeParams>(); | ||
const auto& tensors = data_ctx.tensors; | ||
float elapsed = 0; | ||
|
||
auto in_strides = MakeStrideArray<6>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Perf] #2334 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
kernel_name << "naive_conv_"; | ||
if(miopen::IsEnvvarValueEnabled("MIOPEN_USE_PACKED_CONV_REF_KERNEL")) | ||
{ | ||
kernel_name << "naive_conv_packed_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] The packed kernels are not used unless the environment variable is set. Which in fact means that this PR contains some dead code which is never run at users' sites. Why this is necessary?
The packed kernels is not tested as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I want to keep the packed
variants of kernels for debugging purposes as we found out that debugging becomes easier if I can use the packed
variant with an env variable
and quickly establish that the non-packed
variant or how it is invoked is the source of error.
Here's my suggestion. I will remove the packed
variants and this env
variable in a follow up PR. For this PR, I can put the if(env var enabled)
part in a #ifndef NDEBUG
so that it is enabled in debug mode only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add: use of strides is new in MIOpen it seems. Many kernels still don't support strides and I have found that often tests don't setup the strides correctly, so the env variable helps ease debugging such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found that often tests don't setup the strides correctly
This shouldn't be so. Please include fixes for such cases or open a ticket at least. Thanks.
@averinevg Please notice this review thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my suggestion. I will remove the
packed
variants and thisenv
variable in a follow up PR. For this PR, I can put theif(env var enabled)
part in a#ifndef NDEBUG
so that it is enabled in debug mode only.
Maybe we can use "old good" packed variants for convolutions that use packed tensors? Because they are well-tested and may give better performance? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a possibility however, the change is more involved (need logic inside Invoker lambda to test tensors for being packed and non-packed and then call one of two kernels) and I am not sure how much performance benefit we would get (if you look at the kernel code, we replace some integer multiplications with access to kernel parameters, i.e. the strides, which may be cached and cheap to access).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need logic inside Invoker lambda to test tensors for being packed and non-packed and then call one of two kernels
Of course we do not want any additional logic in the invoker (and actually more because in this case we have to build both kernels even if only one will be used). But we can perform the test and select proper kernel outside of InvokerFactory, right in GetSolution(), because packed and non-packed convolution problems will have different Invokers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next commit will put packed
name + env variable check in #ifndef NDEBUG
and a future PR will remove packed kernels altogether. (haven't pushed it yet because I want the CI to run fully).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would like to get rid of packed kernels, then please make sure that the non-packed kernels are thoroughly tested (several thousands of configs) with packed convolutions. Why: #2334 (comment)
@JehandadKhan Do we care about performance of naive kernels?
template <> | ||
miopenDataType_t GetDataType<float>() | ||
{ | ||
return miopenFloat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests: The "float" tests should only be run if MIOpen is built with the "-DMIOPEN_TEST_ALL=On" cmake option and none of the "-DMIOPEN_TEST_HALF/INT8/BFLOAT16/DOUBLE=On" options specified (i.e. when we want to test the float datatype).
Similarly, the "half" tests should only be run if MIOpen is built with the "-DMIOPEN_TEST_ALL=On" and "-DMIOPEN_TEST_HALF=On" etc.
Example implementation of similar stuff: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/f385d8baaddc9445d8559bccc1fa34266e289e95/test/gtest/conv_igemm_mlir_xdlops.cpp#L92
Please also take this into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this comment is applicable to my PR. My PR deals with reference
kernels tests and does not touch CK
kernel tests outside of the issue where weight
dimensions were not setup correctly (hence the change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Notice] Please note that the "reference" kernels are used not only for testing. These are the "last chance" solutions for the cases when MIOpen is unable to find more performant solvers.
Let me ask a couple of questions that may help us to resolve this comment:
- Do the new tests always validate all datatypes?
- Do you know that MIOpen built with
-DMIOPEN_TEST_HALF=On
will test only FP16 datatype?
By the way, do the new tests use CPU to calculate the reference output? If not, then the new kernels are testing themselves ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atamazov :
Question1: AFA I can tell, reference kernel supports all data types. See last few lines of test/gpu_reference_kernel.cpp
.
Question2: I don't know. The way the test/gpu_reference_kernel is written, it doesn't seem to care.
Question3: We test output against CPU reference, so no testing against self :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question1: AFA I can tell, reference kernel supports all data types. See last few lines of
test/gpu_reference_kernel.cpp
.
If the new tests always validate all datatypes, then it should be changed. The tests should observe some environment variables and skip all datatypes except one. The idea is to not spend testing time for, say, FP16 and BF16 tests when only FP32 testing is asked. Some hints are provided above at #2334 (comment).
Question2: I don't know. The way the test/gpu_reference_kernel is written, it doesn't seem to care.
Yep. That's why I am asking for changes.
Question3: We test output against CPU reference, so no testing against self :).
Good.
// * strides here are in the little-endian order, i.e., for NHWC, stride for N is | ||
// at index 3 while stride for C is at index 0. This is reverse of how strides are | ||
// stored in tensor descriptors, which are big-endian. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[R] The term "endianness" is not applicable here. You can simply say that strides are in NHWC layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree respectfully. endianness
matters. I can store the strides for NHWC in an array left-to-right or right-to-left. Currently MIOpen makes an even weirder choice of storing all strides in NCHW
order left-to-right. We should discuss this in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan Then I recommend to invent a new word. "Weirdness" seems to be a good candidate ;) In the computing world, endianness has very specific meaning.
Currently MIOpen makes an even weirder choice of storing all strides in
NCHW
order left-to-right.
This is sad. Fortunately, I am not involved in such design. I don’t have time to review everything. Although this is for good - I don’t want to end my days in Bedlam ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@averinevg Can you please look at this review thread as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan The concept of endianness
applies to byte order in integers. The order of variables in a sequence is ... just the order of variables in a sequence 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept I'm trying to describe is very very close to endianness and I believe it's appropriate to borrow this term. See for example: https://hackage.haskell.org/package/polynomial-0.5/docs/Math-Polynomial.html The question in this link is how to store polynomials, and I think big-endian vs little-endian conveys the concept clearly as opposed to saying "left-to-right" or "right-to-left" because there is no left or right in memory and the terms left or right are not clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan The term endianness
has a very specific definition in the technical literature. Any developer can contribute to this library, and reading your comment they will understand it not quite the way you intended. Therefore, in order not to cause confusion, let's use generally accepted terms in accordance with their meaning. I think that references to the terms big endian
and little endian
can be removed from your comment; the meaning will not be affected and there will be no misunderstandings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next commit will remove the terms endianness
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan [notice] I think you meant next PR.
@@ -47,29 +48,20 @@ inline HipEventPtr make_hip_event() | |||
|
|||
#if 1 // Keep around other storage techinques -- @pfultz2 27.03.2017 | |||
|
|||
#if 1 // Keep around other storage techinques -- @pfultz2 27.03.2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[R] I am not sure that we want to remove this. In my previous PR I decided to keep this as an illustration of the original idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original had a bug on line 54. The new code is the same as original except that line 54 is fixed. I don't think there's any benefit in keeping buggy code (rather may cause harm if someone assumes that it works).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan I mean keeping the following code:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/6d539ee81321121570606e4ef62e6d072775bbd8/src/include/miopen/hipoc_kernel.hpp#L65-L71
which was never used but explains the original idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any value in keeping this code. It's just a simple pair class. We have always needed proper alignment for kernel arguments, so keeping around code that doesn't address alignment issues is just misguiding and confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets clean up the code, if we ever need it I am sure its not too difficult to write it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan This comment is just a recommendation. You are free to either follow recommendations or deny them, no explanations needed.
so keeping around code that doesn't address alignment issues is just misguiding and confusing.
Thanks for positive feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan When one or more tensors are non-packed we have to add stride information to the serialized convolution problem config (which is used as a key in some persistent databases etc). If all tensors are fully packed, then the serialized convolution problem should remain intact. This is necessary for backward compatibility.
The above means that the stride information in the serialized convolution problem is optional.
Specifically the following function needs to be modified:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/6d539ee81321121570606e4ef62e6d072775bbd8/src/conv/problem_description.cpp#L141
The optional stride information should be added after optional group count. You are free to choose the format, but please make it human readable and try to save space (for example, you can serialize only non-packed dimensions)
- Note: The last (the rightmost in the layout string) dimension is non-packed when its stride is not 1. Other dimensions: the i-th dimension is non-packed when its stride is NOT equal to the product of the (i+1)-th dimension by the (i+1)-th stride.
Similar changes needs to be done in
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/6d539ee81321121570606e4ef62e6d072775bbd8/src/conv/problem_description.cpp#L106
The above function is used as a key for the Invoker cache,
The difference with ProblemDescription::Serialize
is that there is no necessity to keep backward compatibility, so stride information can be added unconditionally. However I would recommend reusing the same approach, just to keep the string shorter when all tensors are packed.
/cc @JehandadKhan @DrizztDoUrden @CAHEK7
💡 UPDATE: Decided to add only an optional "non-packed" flag. See #2334 (comment) for more info.
@amberhassaan To continue #2334 (review): You can find the implementation of similar stuff at #2413, I highly recommend glancing at it. |
@amberhassaan To continue #2334 (review): We have to update the SQL queries with stride information. See #2413 (comment) which is about similar problem. This is not urgent for now, because existing solvers that can handle non-packed tensors are not tunable. But once we have at least one it will become a blocker immediately. We have to make some decision (e.g. ignore, postpone, implement right here, implement in the follow up PR) before merging this. UPDATE: [resolved] Decided to postpone, #2423 opened. 💡 UPDATE: Decided to add only an optional "non-packed" flag. See #2334 (comment) for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amberhassaan Please address the ProblemDescription issue in a subsequent PR.
@@ -142,14 +142,27 @@ ConvSolution ConvDirectNaiveConvBwd::GetSolution(const ExecutionContext& ctx, | |||
}(); | |||
kernel.comp_options = ConvDirectNaiveConvCompileOption(ctx, problem); | |||
|
|||
int G_stride_idx = conv_internal::GetGroupStrideIndex(problem); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Recommendation]
using namespace conv_internal;
somewhere above.
This is continuation of #2334 (comment)
@@ -54,5 +60,92 @@ bool IsOutputBfp16(const ProblemDescription&); | |||
bool IsOutputInt8(const ProblemDescription&); | |||
bool IsOutputInt32(const ProblemDescription&); | |||
|
|||
namespace conv_internal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Recommendation] conv_naive
This is continuation of #2334 (comment)
result.invoker_factory = [=](const std::vector<Kernel>& kernels) { | ||
const auto kern = kernels[0]; | ||
return [=](const Handle& handle, const AnyInvokeParams& primitive_parameters) { | ||
decltype(auto) data_ctx = primitive_parameters.CastTo<conv::DataInvokeParams>(); | ||
const auto& tensors = data_ctx.tensors; | ||
float elapsed = 0; | ||
|
||
auto in_strides = MakeStrideArray<6>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f someone will call the "same" convolution but with different strides this is actually different problem config, i.e. different convolution. We library shall create dedicated Invoker for it.
UPDATE:
In discussion with @JehandadKhan, we decided not to include strides of non-packed tensors to the database keys. Only an optional flag (saying that at least one tensor is non-packed) should be included there. The above means that databases will share the same find-db records, same Invoker instances and same perf-db information for the non-packed convolutions that differ only in strides.
The above design should work correctly provided that:
- If an Invoker instance is able to compute some non-packed convolution, then the same instance must be able to compute any similar non-packed convolution that differs only in strides.
- [Consequence 1] An Invoker that is used for computation of non-packed convolutions must read stride information from
DataInvokeParams
(passed to Invoker) and pass it to the kernels as run-time parameters. - [Consequence 2] Stride information should not be used for building any compile-time parameters.
- [Consequence 1] An Invoker that is used for computation of non-packed convolutions must read stride information from
- If a kernel that is used to compute a non-packed convolution requires tuning, then the same tuning parameters must provide similar performance for non-packed convolutions that differs only in strides.
From now on, the above is becoming a part of the Solver/Solution/Invoker specification.
@@ -73,7 +73,7 @@ static boost::filesystem::path HipBuildImpl(boost::optional<TmpDir>& tmp_dir, | |||
auto env = std::string(""); | |||
|
|||
if(params.find("-std=") == std::string::npos) | |||
params += " --std=c++11"; | |||
params += " --std=c++17"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This affects all the HIP kernels included into the library (that do not explicitly specify C++ standard) and therefore requires extensive testing (and should be done in a separate PR at least). Please revert this and produce the necessary option in the naive solvers.
kernel_name << "naive_conv_"; | ||
if(miopen::IsEnvvarValueEnabled("MIOPEN_USE_PACKED_CONV_REF_KERNEL")) | ||
{ | ||
kernel_name << "naive_conv_packed_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would like to get rid of packed kernels, then please make sure that the non-packed kernels are thoroughly tested (several thousands of configs) with packed convolutions. Why: #2334 (comment)
@JehandadKhan Do we care about performance of naive kernels?
@junliume @JehandadKhan Note that as #2334 (review) is not resolved, then |
* [Windows] roctracer: disable on Windows (not supported) (#2404) Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> * [MI200] Refresh kdb using db_sync (#2411) * Removal of convolution context (#2402) * [Jenkins][CI] clean workspace after each stage (#2412) * [tests] convert test_conv_igemm_mlir_fwd to gTest (#2291) * Revert "cmake: enable finding installed ZStd library (#2362)" This reverts commit e608b43. * Revert "Revert "cmake: enable finding installed ZStd library (#2362)"" This reverts commit 1e325a7. * Bump cryptography from 41.0.3 to 41.0.4 in /docs/.sphinx (#2408) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.3 to 41.0.4. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.3...41.0.4) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [SWDEV-416089][Doc] convolution API in MIOpen is restricted to alpha = 1.0 and beta = 0.0 (#2419) * [HotFix] zstd dependency on multi Linux distributes (#2417) * [CI][Jenkins] Enable rebooting in CI stages for CI stages with GPU use (#2420) * conf_reboot * configs_chg * [Bug Fixes] miopen_rocblas_gemm_ex3 call - invoker cache extra elements - conv direct naive input cast (#2414) * bugfixes miopen_rocblas_gemm_ex3 call would always throw error invoker cache adding extra elements conv direct naive yielding incorrect input cast for kernel arg * clear clang format issue --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [CI][Jenkins] Disabling smoke stages for CI branch runs (#2422) * [Tests] disable solver ConvHipImplicitGemm3DGroupWrwXdlops on Vega10 (#2432) * [Dockerfile] Upgrade cmake so that MIOpen docker can compile Composable Kernel (#2424) * upgrade cmake so that MIOpen docker can compile Composable Kernel * pin the cmake version to 3.27.5 * [Bug Fix] Compilation fix for -DMIOPEN_USE_ROCBLAS=Off (#2435) * bg/lwpmiopen 193 : Integrate CK's batch norm backward training into non-tunable MIOpen solver (#2385) * Reference kernel for 3D convolution for non-packed tensors (#2334) * [Doc] Bump rocm-docs-core from 0.24.2 to 0.25.0 in /docs/.sphinx (#2434) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.2 to 0.25.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.24.2...v0.25.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix weight tensor intialization to replace old PR1950 (#2436) * Add typecast to config key (#2413) * add typecast value to config key, as optional arg to fdb_key * fix clang-format issue * Save space in db key and optimize code. Do not print casting value when casting is not actually necessary. Co-authored-by: Artem Tamazov <artem.tamazov@gmail.com> * do not print casting to confkey when unnecessary, code cleanup, datatype rename * move GetDataTypeName to problem_descrption_base.hpp, organize includes * fix missing header --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Artem Tamazov <artem.tamazov@gmail.com> * [Bugfix] Add cast swapping for swapped gemm inputs. (#2443) * add swapping for cast types when swapping A+B for gemm * [Bugfix] Kernel name fix, compilation err fix (#2446) * Bump gitpython from 3.1.35 to 3.1.37 in /docs/.sphinx (#2445) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.35 to 3.1.37. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.35...3.1.37) --- updated-dependencies: - dependency-name: gitpython dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add MIOPEN_BETA_API defines around f8 (#2430) --------- Co-authored-by: JD <jahandad@gmail.com> * Remove INT8x4 support (#2441) * Test non-packed inputs with naive reference convolution kernels (#2394) * 3D forward convolution solver with non-packed input tensors (#2418) * Bump CK comit for ROCm 6.0 (#2439) * [Jenkins][CI] Enabling Nightly Runs for Nightly Branch w/ build_smoke_(fp32 + aux1 + fp16_bf16_int8) (#2437) * Remove ck solver's strides restriction (#2438) * remove ck solver's strides restriction * bn_cleanup: rename to in_strides * [tests] remove std::rand usage (#2400) * remove std::rand usage * remove deprecated code * Bump rocm-docs-core from 0.25.0 to 0.26.0 in /docs/.sphinx (#2451) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.25.0 to 0.26.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.25.0...v0.26.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [tests] test_tensor_api enhancement (#2450) * [MI100][MI200] Kernel db updates (#2454) * [tests] bg/ck_gfx_white_list: start using ck_utility::is_ck_whitelist to restrict tests to applicable platforms (#2458) * bg/ck_gfx_white_list : start using ck_utility::is_ck_whitelist function for all CK solvers * bg/ck_gfx_white_list: fix review comments * Rename function and return invalid solution instead of throwing an error (#2457) * Fix the return code on the workspace API (#2460) * regression: do not use file system symbolic/hard links (#2425) Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> Co-authored-by: JD <jahandad@gmail.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> * bg/fix_ck_guard_in_bn : fix CK guard around bn (#2464) Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Bugfix] Layernorm Test: add missing hip_runtime.h include (#2465) * Remove redundancy. Replace test_layernorm_test with test_layernorm (#2467) * add missing hip_runtime.h include * rename the test to remove redundancy * [Doc] fix: update guides locations (#2456) * [MI300] add CI test stages (#2396) * [Windows] comgr: fix compiling with HIP SDK 5.5+ on Windows (#2364) Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Windows] cmake: add option for building shared libraries (#2361) * [Windows] remove unused sys/time.h header file (#2360) Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> * Finally remove INT8x4 support. (#2452) * [Windows] fix sequences for Windows (#2359) * Fusion Find (#2388) * [tests] tensor_holder enhancement (#2449) * [Windows] cmake: bump up the minimum required version to 3.15 (#2356) Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> * [HotFix] Missing MIO_BN_GFX110X when building kernels (#2473) * [Tuning][MI100][MI200] Gold19 (#2470) * TunaNet Integration: MI250x (#2421) * [tests] remove direct std::random_device usage. (#2397) * Add a check for packed tensors for convolution solvers (#2471) * Bump urllib3 from 1.26.15 to 1.26.18 in /docs/.sphinx (#2462) Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.15 to 1.26.18. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](urllib3/urllib3@1.26.15...1.26.18) --- updated-dependencies: - dependency-name: urllib3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Dependency][CK] Bump CK Commit Hash (#2479) Regular promotion to newer CK commit hash * [Windows] half: fix compiling with HIP SDK 5.7+ on Windows (#2363) * Added an API function for find 2.0 activation problem creation. (#2448) --------- Co-authored-by: Artem Tamazov <artem.tamazov@gmail.com> Co-authored-by: Evgenii Averin <86725875+averinevg@users.noreply.github.com> * [Windows] cmake: generate export header function (#2348) * cmake: generate export header function * incorporate review feedback --------- Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> * [tests] Refactor cache test to gTest (#1652) * Fix transposed convolutions (#2487) * [Windows] cmake: enable testing on Windows (#2380) * [tests] write 3d test that uses 2d gpu kernel (#2401) * Update dependencies and Dockerfile ROCm versions (#2463) * update sqlite3 and boost * fix MIOPEN_USE_COMPOSABLEKERNEL issue; more updates * sync googletest version with FIN * fix merge conflict * suppress float-equal warning in gtest * update FIn to the latest of its develop * update docker rocm to 5.7.1 * [Tests] Removed support for OCL backend. Do not print rocminfo output unless GPU detection failed. Some cleanup. (#2490) * [Windows] cmake: fix cmake/googletest.cmake on Windows (#2350) * cmake: fix cmake/googletest.cmake on Windows * incorporate review feedback --------- Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> * [Windows] cmake: use imported target for threads library instead of variable (#2355) * Remove a check that was missed for packed tensors (#2495) * Fix builds with rocBLAS that does not support F8 (#2480) * fix-build-old-rocblas-no-ck(01) Fix builds with rocBLAS that does not support F8 * fix-build-old-rocblas-no-ck(02) CK BN bugfixes. Fixes for builds without CK. * fix-build-old-rocblas-no-ck(03) Update fin up to the most recent commit in develop * fix-build-old-rocblas-no-ck(06) Resolve review comment * [Workaround] Issue 2496 - disabling the unit test case in wrw solver (#2497) * [NFC] Replace miopen::ProblemDescription with conv::ProblemDescription, part 4 (#2410) * Bump rocm-docs-core from 0.26.0 to 0.27.0 in /docs/.sphinx (#2501) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.26.0 to 0.27.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.26.0...v0.27.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update the SyncDB tests to use multi-threading (#2407) * Workaround for issue #2492 - disable ConvBinWinoRxS when granularity loss is huge (#2507) * workaround_issue_2492(01) Disable ConvBinWinoRxS when granularity loss is > 0.995 (performance drops 200 times) * workaround_issue_2492(02) Allow disabing the W/A by setting MIOPEN_DEBUG_WORKAROUND_ISSUE_2493=0 in the env. * Workaround for issue #2492 part 2 (improvement) (#2510) * workaround_issue_2492(01) Disable ConvBinWinoRxS when granularity loss is > 0.995 (performance drops 200 times) * workaround_issue_2492(02) Allow disabing the W/A by setting MIOPEN_DEBUG_WORKAROUND_ISSUE_2493=0 in the env. * workaround_issue_2492(03) [debug] Disable MIOPEN_DEBUG_WORKAROUND_ISSUE_2493 during driver warm-up. * workaround_issue_2492(04) [quality] Make the compuation of max granularity loss more clear. * workaround_issue_2492_01(02) [debug] Log granularity loss when ConvBinWinogradRxSf2x3* solver is skipped. * workaround_issue_2492_01(03) [tests] test_db_sync: Disable WORKAROUND_ISSUE_2493 via environment. Support reading legacy fdb (WORKAROUND_ISSUE_1987). Allow FDB testing on gfx1030 (SKIP_KDB_PDB_TESTING). Add W/A for ConvOclDirectFwdFused on gfx1030. Print number of failures per testing thread. * workaround_issue_2492_01(04) Remove leftovers from gfx1030 testing * workaround_issue_2492_01(05) More gfx1030 leftovers removed * Find 2.0 problem fusing (#2466) * [Windows] fix for end of line issue on Windows (#2515) Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Windows] cmake: strip mingw32 support and cross-compilation out (#2352) * [Windows] cmake: cleanup outdated code (#2349) * [Windows] cmake: make building tests optional (#2351) * [Tests] Relocate Googletest to dev requirements (#2512) * Step 0: build pass but tests fail * Make test build passes * Adopt reviewer comments * Integrate CK's layer norm into MIOpen solver (#2481) * [Tests] Limit layernorm CK test applicability (#2528) * Bump rocm-docs-core from 0.27.0 to 0.28.0 in /docs/.sphinx (#2534) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.28.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.27.0...v0.28.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Windows] cmake: a few fixes for multi-config generators (#2357) * [clang-tidy] Use config file for clang-tidy configuration (#2489) * [Tests] split teardown to runtest and verify in layernorm gtest (#2535) split teardown to runtest and verify * [Enhancement] Add checks on workspace params (#2498) * added checks on workspace params * addressed review comments * fix release build warning * Revert "[Enhancement] Add checks on workspace params (#2498)" This reverts commit aa878a8. * [HOTFIX] Fix offline HIP builds after PR #2357. (#2544) * [conv] Remove clamping to MAX from CastTensor used in Bwd and WrW (#2538) * conv-bwd-wrw-disable-clamping(01) [wip] Add clamping parameter to CastTensor() and set it to proper value. * conv-bwd-wrw-disable-clamping(02) Clamping in SubTensorOpWithCastTensor1d. * conv-bwd-wrw-disable-clamping(03) Clamping in SubTensorOpWithCastTensor2d/3d/4d/5d * conv-bwd-wrw-disable-clamping(04) Removed WORKAROUND_ISSUE_2496 * [CK] Bump CK commit hash (#2540) * [Windows] cmake: use built in operator to component-wise version comparison (#2353) Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Hotfix] when MLIR is not used in MIOpen (#2549) * [Doc] Standardize documentation for ReadtheDocs (#2548) Relates to ROCm/rocm-docs-core#330 * [quality] Fix: always define MIOPEN_LIBMLIR_SUPPORTS_GFX103X_DEFAULT (#2552) * ConvOclDirectFwdGen: Fixed out-of-bounds memory access (#2546) * Find 2.0 fusion (#2486) * [Doc] Remove dated comments in test CmakeLists.txt (#2551) * Bump cryptography from 41.0.4 to 41.0.6 in /docs/sphinx (#2561) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.4...41.0.6) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Test] fix gemm driver dataType initialization (#2558) fix gemm dataType initialization and make gemm driver more dataType friendly * Environment variables update (#2514) * Use split CK libraries. (#2526) * Sum enhancement in case of inner dim reduce (#2543) * Initialize sum, modify layernorm * FLOAT to FLOAT_ACCUM in kernel, fix kernel index and host test and, split teardown to runtest and verify * remove unused var, int64_t to size_t, add two kernel profile, fix kernel index error, change reqd_work_item_cnt * Use GetMaxComputeUnits, fix GetSumWorkspaceSize flow * Add doxygen, add test case * remove MIOPEN_BETA_API * modify tolerance, add solver list * alignment * add IsImprovementOverROCm, reduce to sqrt(reduce), modify test case * throw to return false in performance check, duplicate code to function, fix wrong allocate memmory size * add experimental caution in doc, add memory copy check in driver, add detail in verify result of driver * modify tolerance * modify get input in driver * [Windows] cmake: replace UNIX with NATIVE command for separate_arguments() (#2555) * [Find 2.0] Bias for Find 2.0 fusion (#2525) * [HotFix] Env Var set conflicts between #2543 and #2514 (#2571) * [Doc] Bump rocm-docs-core from 0.29.0 to 0.30.0 in /docs/sphinx (#2572) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.29.0 to 0.30.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.29.0...v0.30.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sam Wu <sam.wu2@amd.com> * [Doc] document NHWC convolution support (#2575) * Forward, backward data and backward weight convolution solver with fp8/bfp8 compute datatype. (#2531) * [HotFix] disable the f8 test cases that failed the f8 reference kernel in gtest (#2576) * [HotFix] Disable f8 gtest cases that might cause CI fails. (#2577) * [CK] Bump CK commit hash for staging (#2581) * Fix the f8 reference kernel issue that failed CI (#2586) * Patch necessary to make FP8 convolution compile with hiprtc (#2584) * [Doc] Bump rocm-docs-core from 0.30.0 to 0.30.1 in /docs/sphinx (#2589) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.0 to 0.30.1. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.0...v0.30.1) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [DOC] Doxygen change: enable warning as error msg and add missing API comments (#2585) * AI Based Parameter Prediction Model for conv_hip_igemm_group_fwd_xdlops Solver (#2523) * [HotFix] KDB Files should not be in the runtime package (#2591) * [Doc] Adding issue template (#2590) * [Doc] Add documentations for non-packed tensors convolution (#2537) * edit document of convolution * address comments --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Doc] Fix broken links in README.md (#2595) * Add nightly update workflow (#2579) * Tests for RNN seq API (#2493) * [HotFix] Fix Windows build with disabled CK (after #2523) (#2598) * Properly guard CK usage by MIOPEN_USE_COMPOSABLEKERNEL defines * Update src/solver/conv_hip_implicit_gemm_grouped_fwd_xdlops.cpp Co-authored-by: Artem Tamazov <artem.tamazov@gmail.com> --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Artem Tamazov <artem.tamazov@gmail.com> * [MIOpenDriver] Enabled gemmfp16. [tests] Added smoke test for fp16 and fp32 gemm. (#2592) * fix-gemmfp16(01) [MIOpenDriver] Enable gemmfp16 in the driver * fix-gemmfp16(02) [tests] Add smoke test for fp16 gemm * [Doc] Fix URLs (ROCmSoftwarePlatform -> ROCm) in the doc, comments, and code. + more (#2597) * Update URLs (ROCmSoftwarePlatform -> ROCm) in the documentation and comments in the source code. * (2) Update URLs (ROCmSoftwarePlatform -> ROCm) in the documentation and comments in the source code. * Fix incorrect link * Fix links * [HotFix] Bump CK commit hash for F8 patch (#2603) * [Doc] Fix broken links in CONTRIBUTING.md (#2601) * Fix broken rocmsoftwareplatform.github.io links in CONTRIBUTING.md * Use new organization name for repoistory links * [Windows] use find_package() for Eigen and frugally-deep (#2574) * [Windows] enable compilation on Windows (#2570) * [HotFix] 3D Group Conv Backward data and weight update. Failure noticed when pads and strides are not 1 (#2560) * [CMake] fix find_package(... GLOBAL) for CMake < 3.24 (#2610) * [HotFix][atamazov] multiple undefined behavior discovered with -fsanitize=undefined in DEV builds (#2609) * fix-issue-2602(01) Fix for smoke_miopendriver_gemm * Do not print output parameters in MIOPEN_LOG_FUNCTION calls. --------- Co-authored-by: atamazov <artem.tamazov@gmail.com> * [hipRTC] resolve symbol issues by explicitly link with hipRTC (#2612) * explicitly link with hipRTC * Update formatting * Consider MIOPEN_USE_HIPRTC=Off * Clean up --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> * Standardize workspace abstraction (#2524) * [gtest] conversion for code coverage tests (#2580) * [HotFix] revert #2580 and re-enable smoke tests (#2616) * Revert "[gtest] conversion for code coverage tests (#2580)" This reverts commit c5a2384. * re-enable smoke tests in CI * remove problematic github action * [Windows] use find_package() for SQLite3 (#2564) * [Doc] Bump rocm-docs-core from 0.30.1 to 0.30.2 in /docs/sphinx (#2620) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.1 to 0.30.2. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.1...v0.30.2) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Windows] use official ZStd package from Facebook (#2565) * Remove MIOpenGEMM and MIOpenTensile leftovers (#2499) * Remove FIN_OLD_PROBLEM_DESCRIPTION_COMPAT (#2503) * [Jenkins] Add NOMLIR stage. [Workaround] Limit usage of gfx908 nodes in non-nightly builds (#2622) * Get rid of legacy 2GiB offset limits in CallGemm*() and transpose*() internal APIs and kernels. (#2613) * [BugFix] Proper fix for backward passes bwd/wrw for CK group conv 3d (#2619) * [BugFix] asm igemm fwd kernel will have computation error when c <=4 and dilation_y > 1, workaround (#2625) * Fused solver for Fwd Convolution with Residual add, Bias add and then activation function (#2517) * Bump MIOpen version to 3.1.0 and update CI docker (#2519) * [HotFix] resolve unknown type issue after #2517 (#2629) * [Doc] Bump rocm-docs-core from 0.30.2 to 0.30.3 in /docs/sphinx (#2628) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.2 to 0.30.3. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.2...v0.30.3) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [HOTFIX] Fix build with -DMIOPEN_USE_COMPOSABLEKERNEL=Off after #2517. (#2630) * [Jenkins][Tests] Add stage with -DMIOPEN_USE_COMPOSABLEKERNEL=Off after #2517 #2630. (#2631) * [HOTFIX] Fix build with -DMIOPEN_USE_COMPOSABLEKERNEL=Off after #2517. * add -DMIOPEN_USE_COMPOSABLEKERNEL=Off stage * make NOCK stage anyAPU and build ONLY * Adopt recommended changes * rename config_targets to make_targets * Extend GTest DISCOVERY_TIMEOUT to 5 mins * [Tests] add unit test for #2624 (#2632) * [gtest] Combine gtests into single binary. (#2599) * [Windows] rocblas: disable Beta API on Windows for HIP < 5.7 (#2405) * [tests] Limit applicability of ConvFwdBiasActivAPI/ConvFwdBiasResAddActivTest.ConvFusedAPI (#2635) * [Tests] helper for evn variables update in gtests (#2605) Co-authored-by: xinlipn <xinlipn@gmail.com> * [Windows] fix compilation of math functions on Windows (#2568) * [Windows] fix printf type incompatibility between type specifiers (#2569) * Fix miopen package dependency roctracer etc (#2508) * [Doc][NFC] added rocm v6, mi300, and default component (#2618) * [Windows] add a class to allow os-agnostic process execution (#2567) * [Windows] make BZip2 a required package (#2566) * [Windows] add missing symbol export (#2556) * add missing symbol export * more missing exports * fix format issues --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: Alex Eremin <CAHEK7@yandex.ru> * [ROCm 6.1][hipRTC] Fix build failures. [quality] Reorg standard includes in HIP sources. (#2637) * [WORKAROUND] Disable W/A for issue #1359 starting from ROCm 5.4.3. (#2225) Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Dep] Bump CK commit hash for staging (#2640) * [Windows] default paths to user and system db files on Windows (#2365) * Fix COMgr dependency in MIOpen package (#2645) * [ROCm 6.0.1] Automatically activate the new HIPRTC PCH adaptations starting from the 6.0.24000 version. Fix some build errors. (#2644) * Automatically activate the new HIPRTC PCH adaptations starting from the 6.0.24000 version. Fix some build errors (#2465 + more) (cherry picked from commit 4f695d9) * Remove duplicated includes. * [HOTFIX] Adapt to changes in HIP Mainline 417 (possibly future 6.1 RC) (#2652) * fix-rocm61rc417(01) Disable new kernel build warnings. [NFC] Sort headers properly. * fix-rocm61rc417(02) [ROCm 6.1][HIPRTC] Use custom implementations instead of standard <limits>. This fixes build issues with ROCm 6.1. * fix-rocm61rc417(03) [ROCm 6.1][HIPRTC][Bugfix] Fixed issue in miopen_limits.h that prevented the use of custom implementations. * fix-rocm61rc417(04) [ROCm 6.1 RC][HIPRTC] Disable some of the custom implementations from <type_traits> (like `integral_constant`) for HIP mainline 417. This fixes some build issues. * fix-rocm61rc417(05) [ROCm 6.1 RC][offline compiler] Removed "-mcpu" from build options. This resolves kernel build issues with HIP mainline 417 (offline compiler). Improved diagnostic messages output onto console after offline build failures. * fix-rocm61rc417(06) [tests] Disable some testcase from handle_test as #2600 still persists in Hip Mainline 417. --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> * Correct parameter which prints unused flag in log fusion cmd (#2653) * [MI300][Tuning] Tunings for SWDEV tickets (#2654) * add initial tunings for mi300 * add test to db_sync * [ROCm 6.0.1] Fix merge error in #2652 that affects #2644. (#2658) * [CK] Bump CK commit hash for staging (#2659) * Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx (#2662) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.37 to 3.1.41. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.37...3.1.41) --- updated-dependencies: - dependency-name: gitpython dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#2666) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Doc] Updated links to ROCm Repositories (#2667) Changed <old-organization> to "ROCm". * [SWDEV-433582] Search-proofed PrepareInvoker (#2661) * [HotFix] fix clang format issue from #2661 * [FIN] update submodule (#2660) * [Windows] replace [[gnu::noreturn]] with [[noreturn]] (#2656) * [Windows] addkernels: fix operations on path for Windows (#2657) * [Windows] clean up the setting of environment variables cross-platform (#2655) * clean up the setting of environment variables cross-platform * fix clang-tidy * Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#2676) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.3...v0.31.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Argmax enhancement in case of inner dim reduce (#2583) * [Test] Convert conv_igemm_dynamic_dlops etc to gTest (#2553) * [Bugfix] Restore Missing ctests (#2649) * [Windows] fix compilation on Windows (#2677) * [Windows] cmake: unpack kernels into a build directory (#2347) * Remove FIN_OLD_HANDLE_COMPAT and FIN_OLD_BINARY_CACHE_COMPAT (#2627) * Rename transpose* kernels (leftover of #2613) (#2673) * [CK] Bump CK commit hash for staging (#2683) Update CK to the latest staging * [zlib] Update rocm-recipes for more reliable zlib link (#2686) * [OCL] Use OpenCL 2.0 while compiling kernels (#2691) * Fix compilation on SELS/RHEL after #2657 merged (#2690) * [BF16][FP8][BF8] Fixed some specializations from `<limits>` and `<cmath>` (#2669) * conv::ProblemDescription: remove underscores, change return data type (#2685) * Add 2D Group Convolution Backward Data and Weights update solvers. Simplify and unify 3d group conv tests (#2663) * [HOTFIX] Disable "granularity loss" W/A for #2492 and add a new, "tiny tensor" based one. (#2695) * disable 2492 granularity_loss workaround and enable tiny_tensor workaround * workaround_issue_2492_02(01) Macros to uppercase. Add doc for WORKAROUND_ISSUE_2492_TINY_TENSOR. Add conditions N<=4 and C<=4 to the "tiny tensor" W/A. Disable it during warmup, make it controllable by MIOPEN_DEBUG_WORKAROUND_ISSUE_2492. * Update src/solver/conv_winoRxS.cpp --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> * [Clang-Format] Fix format issue * Bump rocm-docs-core from 0.31.0 to 0.32.0 in /docs/sphinx (#2699) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.31.0 to 0.32.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.31.0...v0.32.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [MI300][Tuning] gold 20 (#2697) * add gfx942 superbench winograd tunings, update gold version to 20 * update with more superbench tunings * Remove support for ROCm < 5.6.0 (#2665) * Remove support for ROCm < 5.7.0 * deprecate-rocm-less-5.7(03) Leftover that fixes build error with "-Werror" * deprecate-rocm-less-5.7(04) Resolve review comment * Bump rocm-docs-core from 0.32.0 to 0.33.0 in /docs/sphinx (#2707) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.32.0 to 0.33.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.32.0...v0.33.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [COMGR] Use OpenCL 2.0. [HIPRTC] Provide min/max limits for int. Fix build errors related to min/max limits for BF16. (#2705) * fix-rocm-mainline-issues-01(01) Removeed `constexpr` from numeric_limits<hip_bfloat16>::min()/max() as BF16 ctor provided by HIP can't be used in const expressions. * fix-rocm-mainline-issues-01(02) [COMGR] Globally engage OpenCL 2.0 * fix-rocm-mainline-issues-01(03) [HIPRTC] Provide min/max limits for int * [DOC] fix broken links in docs (#2696) * lwpmiopen_521_correct_doc_issues: fix broken links in docs * lwpmiopen_521_correct_doc_issues: remove citing * [HotFix] Fix DB install after #2347 (#2702) --------- Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com> Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> * Add GroupNorm forward operation (#2623) * fix not reporting LFS missing files (#2710) * [HotFix][WHL] move the bfloat16 header to the proper guard (#2711) * [HotFix] Update FindDB for finetuning (#2712) * [CK] Update CK commit in requirements.txt for staging (#2713) * [Tests] Fix Gtest single executable build issue (#2715) (#2717) Add the missing build job to Jenkinsfile Fix duplicate class name issue in Gtest * [Windows] Do not use HIP runtime headers on Windows (#2719) * don't use WORKAROUND_DONT_USE_CUSTOM_LIMITS on Windows * don't use workaround SWDEV_413293 on Windows * CI base docker updates to ROCm 6.0.2 (#2714) * Softmax ocl refactoring (#2671) * Add cat forward operation (#2562) * [HotFix] Fix namespace conflict issue in gtest after #2562 (#2725) * Bump cryptography from 41.0.6 to 42.0.0 in /docs/sphinx (#2729) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.6 to 42.0.0. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.6...42.0.0) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [DB Install] fix installation of *.fdb.txt and *.db files (#2728) * Update CHANGELOG.md (#2720) * bg/update_change_log_lwpmiopen_501: update change long till rocm 6.1.0 (MIOpen-3.1.0) * bg/update_change_log_lwpmiopen_501: remove typo * bg/update_change_log_lwpmiopen_501: fix broken link * bg/update_change_log_lwpmiopen_501: second attempt to fix hyper link * Create placeholder CODEOWNERS (#2718) Add @JehandadKhan and @junliume as CODEOWNERS. * [Solvers] Fix for #2663 ensure tensor dimensions are consumed by solvers correctly (#2716) * [DOC] Add codeowners for documentation (#2692) * Add codeowners for documentation * Update CODEOWNERS --------- Co-authored-by: samjwu <samjwu@users.noreply.github.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> * Bump rocm-docs-core from 0.33.0 to 0.33.2 in /docs/sphinx (#2733) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.33.0 to 0.33.2. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.33.0...v0.33.2) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix build after #2657 and #2690 (boost::filesystem) (#2732) * [Improvements] Replace HasAtLeastOne64BitTensor() with AllTensorsDimsFitIntoInt() (#2731) * Update CK-based 2d/3d convolution solvers to support nchw/ncdhw layout (#2429) * Bump rocm-docs-core from 0.33.2 to 0.34.0 in /docs/sphinx (#2739) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.33.2 to 0.34.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.33.2...v0.34.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [BugFix] Set System KDB journal_mode to Off (#2724) * [Tests] Converting test_conv3d_extra into GTest (#2554) * [Tests] Convert test_rnn_vanilla , test_gru, test_rnn_extra and test_gru_extra gTests (#2550) * [Doc] Removing unmaintained release notes (#2745) * [CK] Update CK commit in requirements.txt for staging (#2747) * [Tests][gtest] conversion for LSTM (#2545) * Fix for issue #2734: Detect if "-fno-offload-uniform-block" works in HIP compiler. (#2743) * fix-issue-2734 (01) Use "-fno-offload-uniform-block" only if HIP compiler supports it. Resolves #2734. (cherry picked from commit 458c833) Partially changes code from PR #2719 "Do not use HIP runtime headers on Windows" # RESOLVED Conflicts: # CMakeLists.txt * fix-issue-2734(02) Removed W/A from PR #2719 as it is no longer needed. * Enable softmax solver based on attention-softmax implementation (#2737) * [Tests] Replace test_conv_igemm_dynamic_xdlops_bwd with gtest (#2409) * [Tests] Convert ctest to gtest for test_conv_for_implicit_gemm (#2513) * [Tuning][MI300] for m9 tickets (#2754) * [hipRTC] add lowest() for float to MIOpen custom limits (#2753) * [hipRTC] add lowest() to MIOpen custom limits * the earliest trace can be found together with numeric_limits<int> * [Linux] Enhance Compiler flags to avoid Hardcoded ROCm Path (Part 1) (#2694) * Bump rocm-docs-core from 0.34.0 to 0.34.2 in /docs/sphinx (#2755) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.34.0 to 0.34.2. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.34.0...v0.34.2) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Artur Wojcik <artur.wojcik@outlook.com> Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> Co-authored-by: Vasilii Filippov <DrizztDoUrden@users.noreply.github.com> Co-authored-by: JD <jahandad@gmail.com> Co-authored-by: xinlipn <xinlipn@gmail.com> Co-authored-by: Jun Liu <Liu.Jun@amd.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: mentat <108366729+bghimireamd@users.noreply.github.com> Co-authored-by: Reid Kawaja <74506315+reidkwja@users.noreply.github.com> Co-authored-by: amberhassaan <amber_474@yahoo.com> Co-authored-by: Artem Tamazov <artem.tamazov@gmail.com> Co-authored-by: Daming Feng <dmfeng8898@gmail.com> Co-authored-by: Alex Eremin <CAHEK7@yandex.ru> Co-authored-by: Evgenii Averin <86725875+averinevg@users.noreply.github.com> Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: M. Saud Ul Hassan <68208941+msaudulhassan@users.noreply.github.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: saeid-rostami <123997133+saeid-rostami@users.noreply.github.com> Co-authored-by: Seungman Han <120356720+seungmanhan@users.noreply.github.com> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Umang Yadav <29876643+umangyadav@users.noreply.github.com> Co-authored-by: Dmantri98 <109552294+Dmantri98@users.noreply.github.com> Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com> Co-authored-by: xu-shawn <50402888+xu-shawn@users.noreply.github.com> Co-authored-by: Kamil Nasyrov <shurale.nkn@gmail.com> Co-authored-by: jasberc <146053952+jasberc@users.noreply.github.com> Co-authored-by: David Galiffi <dgaliffi@amd.com> Co-authored-by: Kyeonghwan Ryu <89056320+kyeonghwanryu@users.noreply.github.com> Co-authored-by: scerzh <102019268+scerzh@users.noreply.github.com> Co-authored-by: Vsevolod Golovko <vsevolod.golovko2@dxc.com> Co-authored-by: Jungkeun Kim <et16kr@gmail.com> Co-authored-by: samjwu <samjwu@users.noreply.github.com> Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com> Co-authored-by: M.Emin Ozturk <ozturk.27@osu.edu> Co-authored-by: arvindcheru <90783369+arvindcheru@users.noreply.github.com>
Current reference kernels used for verifying convolution (aka naive direct convolution kernels) work only with packed tensors. This PR enhances the state of code by creating variants of these kernels that work with non-packed tensors by taking custom stride specification as kernel parameters.